Validation Issues induced by an Automatic Pre-Annotation Mechanism in the Building of Non-projective Dependency Treebanks

نویسندگان

  • Ophélie Lacroix
  • Denis Béchet
چکیده

In order to build large dependency treebanks using the CDG Lab, a grammar-based dependency treebank development tool, an annotator usually has to fill a selection form before parsing. This step is usually necessary because, otherwise, the search space is too big for long sentences and the parser fails to produce at least one solution. With the information given by the annotator on the selection form the parser can produce one or several dependency structures and the annotator can proceed by adding positive or negative annotations on dependencies and launching iteratively the parser until the right dependency structure has been found. However, the selection form is sometimes difficult and long to fill because the annotator must have an idea of the result before parsing. The CDG Lab proposes to replace this form by an automatic pre-annotation mechanism. However, this model introduces some issues during the annotation phase that do not exist when the annotator uses a selection form. The article presents those issues and proposes some modifications of the CDG Lab in order to use effectively the automatic pre-annotation mechanism.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Label Pre-annotation for Building Non-projective Dependency Treebanks for French

The current interest in accurate dependency parsing make it necessary to build dependency treebanks for French containing both projective and non-projective dependencies. In order to alleviate the work of the annotator, we propose to automatically pre-annotate the sentences with the labels of the dependencies ending on the words. The selection of the dependency labels reduces the ambiguity of t...

متن کامل

On Detecting Errors in Dependency Treebanks

Dependency relations between words are increasingly recognized as an important level of linguistic representation that is close to the data and at the same time to the semantic functor-argument structure as a target of syntactic analysis and processing. Correspondingly, dependency structures play an important role in parser evaluation and for the training and evaluation of tools based on depend...

متن کامل

Building gold-standard treebanks for Norwegian

Språkbanken at the National Library of Norway is currently building up gold-standard Dependency Grammar treebanks for Norwegian Bokmål and Nynorsk. The treebanks are manually annotated for morphological features, syntactic functions and dependency relations. This paper explains the choice of texts and format of the treebanks, some key aspects of the morphological and syntactic annotation, and i...

متن کامل

Automatic Error Detection in Annotated Corpora

Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomali...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014